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Abstract 

Background 

We  can  describe  protein-protein  interactions  (PPIs)  as  sets  of  distinct  domain-domain 
interactions  (DDIs)  that  mediate  the  physical  interactions  between  proteins.  Experimental 
data  confirm  that  DDIs  are  more  consistent  than  their  corresponding  PPIs,  lending  support  to 
the  notion  that  analyses  of  DDIs  may  improve  our  understanding  of  PPIs  and  lead  to  further 
insights  into  cellular  function,  disease,  and  evolution.  However,  currently  available 
experimental  DDI  data  cover  only  a  small  fraction  of  all  existing  PPIs  and,  in  the  absence  of 
structural  data,  determining  which  particular  DDI  mediates  any  given  PPI  is  a  challenge. 

Results 

We  present  two  contributions  to  the  field  of  domain  interaction  analysis.  First,  we  introduce  a 
novel  computational  strategy  to  merge  domain  annotation  data  from  multiple  databases.  We 
show  that  when  we  merged  yeast  domain  annotations  from  six  annotation  databases  we 
increased  the  average  number  of  domains  per  protein  from  1.05  to  2.44,  bringing  it  closer  to 
the  estimated  average  value  of  3.  Second,  we  introduce  a  novel  computational  method, 
parameter-dependent  DDI  selection  (PADDS),  which,  given  a  set  of  PPIs,  extracts  a  small  set 
of  domain  pairs  that  can  reconstruct  the  original  set  of  protein  interactions,  while  attempting 
to  minimize  false  positives.  Based  on  a  set  of  PPIs  from  multiple  organisms,  our  method 
extracted  27%  more  experimentally  detected  DDIs  than  existing  computational  approaches. 

Conclusions 

We  have  provided  a  method  to  merge  domain  annotation  data  from  multiple  sources, 
ensuring  large  and  consistent  domain  annotation  for  any  given  organism.  Moreover,  we 
provided  a  method  to  extract  a  small  set  of  DDIs  from  the  underlying  set  of  PPIs  and  we 


showed  that,  in  contrast  to  existing  approaches,  our  method  was  not  biased  towards  DDIs 
with  low  or  high  occurrence  counts.  Finally,  we  used  these  two  methods  to  highlight  the 
influence  of  the  underlying  annotation  density  on  the  characteristics  of  extracted  DDIs. 
Although  increased  annotations  greatly  expanded  the  possible  DDIs,  the  lack  of  knowledge 
of  the  true  biological  false  positive  interactions  still  prevents  an  unambiguous  assignment  of 
domain  interactions  responsible  for  all  protein  network  interactions. 

Executable  files  and  examples  are  given  at:  http://www.bhsai.org/downloads/padds/ 
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Background 

The  living  cell  is  a  dynamic,  interconnected  system  where  proteins  interact  with  each  other  to 
facilitate  biological  processes.  Large  protein-protein  interaction  (PPI)  datasets  have  become 
available  due  to  advances  in  experimental  biology  and  the  development  of  high-throughput 
screening  techniques.  However,  while  existing  data  describe  thousands  of  protein 
interactions,  such  interactions  still  constitute  only  a  fraction  of  all  PPIs  for  a  small  number  of 
available  organisms  [1-5].  Moreover,  available  PPI  datasets  acquired  from  different 
experiments  are  often  seemingly  inconsistent  with  each  other,  implying  that  the  different 
methods  might  produce  false  positive  interactions  or  fail  to  identify  certain  types  of 
interactions  [4,6-9].  Here,  we  attempt  to  address  this  seemingly  intractable  problem  by 
focusing  on  bioinformatics  approaches  that  use  protein  domains  as  fundamental  building 
blocks  of  protein  interactions. 

Domains  as  protein  interaction  building  blocks 

Proteins  consist  of  one  or  more  domains  and  multiple  studies  have  shown  that  domain- 
domain  interactions  (DDIs)  from  different  experiments  are  more  consistent  than  their 
corresponding  PPIs,  suggesting  that  domains  may  be  fundamental  in  mediating  physical 
interactions  between  proteins  [10-12].  Under  the  assumption  that  protein  interactions  are 
mediated  by  domain  interactions,  we  can  hypothesize  that  each  interaction  in  a  PPI  dataset 
can  be  converted  into  a  corresponding  set  of  pairwise  domain  interactions.  However,  lack  of 
direct  experimental  evidence  for  interactions  at  the  domain  level  means  that  we  can  only 
account  for,  or  explain,  a  small  fraction  of  known  PPIs  for  any  organism  using 
experimentally  determined  DDIs.  Determining  the  particular  domains  that  physically  bind 
(i i.e .,  mediate)  a  given  PPI  based  on  limited  structural  information  remains  a  challenge. 

To  address  this  challenge,  we  must  first  characterize  the  specific  protein  domains  that 
mediate  protein  interactions.  It  is  estimated  that  approximately  80%  of  eukaryotic  proteins 
and  67%  of  prokaryotic  proteins  have  multiple  domains  [13,14].  Most  annotation  databases 
characterize  each  domain  family  using  a  small,  curated  set  of  amino  acid  sequences  common 
to  representative  members.  These  databases  share  a  significant  amount  of  protein-domain 
annotation  data;  however,  each  database  also  contains  a  noteworthy  number  of  unique 
protein-domain  annotations.  Some  databases,  e.g..  Conserved  Domain  Database  (CDD)  [15] 
and  InterPro  [16,17],  provide  protein-domain  annotation  information  collected  from  several 


databases  but  none  provides  the  capability  to  methodically  merge  these  annotations  (Figure 
1A). 


Figure  1  Evaluation  of  different  protein-domain  annotation  merging  strategies.  (A) 

Using  the  InterPro  database,  we  obtained  seven  protein-domain  annotations  for  yeast  protein 
YNL271C  from  three  databases:  PFAM  [32],  Superfamily  (SF)  [33],  and  SMART  [34,35]. 
PFAM  domains:  FH2,  Drf _FH3,  and  two  Drf_GBD  domains;  SF  domains:  Formin  homology 
2  domain  (FH2  domain)  and  ARM  repeat ;  and  SMART  domain:  Formin  Homology.  (B)  The 
naive  domain-merging  strategy  identified  seven  unique  domains  for  YNL271C.  (C)  Sequence 
locations  helped  identify  some  of  the  identical  domains  ( FH2 ,  FH2  domain,  and  Formin 
Homology )  but  was  not  able  to  differentiate  between  different  domains  that  share  the  same 
sequence  position.  (D)  Taking  into  consideration  both  sequence  location  and  domain 
names/labels,  our  merging  strategy  identified  four  unique  domains:  ARM  repeat,  Drf_FH3, 
Drf_GBD,  and  a  domain  consisting  of  FH2  domains  ( FH2 ,  FH2  domain,  and  Formin 
Homology). _ 


Combining  data  from  multiple  databases,  while  addressing  annotation  inconsistencies,  is  a 
non-trivial  procedure.  For  example,  a  naive  domain  annotation  data-merging  strategy 
consisting  of  the  aggregation  of  all  annotation  data  regardless  of  domain  sequence  overlaps  or 
domain  name/label  similarities  would  increase  the  average  number  of  hypothetical  domains 
per  protein.  However,  this  strategy  would  also  overestimate  the  total  number  of  domains, 
because  it  considers  domains  that  are  not  identically  represented  in  two  different  databases  as 
two  different  domains  (Figure  IB).  In  contrast,  considering  sequence  information  as  part  of 
the  naive  merging  strategy,  e.g .,  by  aggregating  all  annotation  data  that  overlap  in  at  least  10 
continuous  amino  acids,  would  reduce  the  number  of  inferred  domains  per  protein.  However, 
such  a  merging  strategy  inherently  assumes  that  all  domains  that  overlap  in  sequence  are 
identical,  leading  to  a  small  number  of  merged  domains  and,  likely,  an  underestimation  of  the 
total  number  of  true  domains  (Figure  1C).  The  strategy  presented  here  combines  sequence 
locations  and  name/label  information  to  construct  merged  domain  annotation  sets  in  which 
the  number  of  domains  per  protein  is  not  a  priori  over-  or  underestimated  (Figure  ID). 

Domain-based  methods  for  reconstituting  whole  protein  interaction  networks 

The  use  of  domains  as  mediators  of  protein  interactions  requires  the  ability  to  assign  domains 
to  all  proteins  under  consideration.  However,  in  the  case  of  multi-domain  proteins,  it  is 
unclear  which  particular  domains  truly  mediate  a  given  PPI  set,  because  more  than  one 
potential  domain  pair  can  account  for  a  single  interaction.  This  uncertainty  could  lead  to 
predictions  of  false  positive  PPIs,  as  domains  identified  as  mediators  of  protein  interactions 
account  not  only  for  the  original  PPI  set  but  also  for  all  other  protein  pairs  that  contain  the 
same  domain  pair  combinations.  Existing  computational  methods  use  varying  approaches  to 
tackle  different  aspects  of  these  problems,  each  with  its  own  set  of  aims,  strengths,  and 
limitations  [18-30].  For  example,  some  methods  use  additional  biological  information,  such 
as  gene  expression  data,  to  establish  whether  a  PPI  can  occur  [24,25,29],  and  others  limit  the 
PPI  coverage  to  smaller  sets  of  high-confidence  interactions  [19,26,27].  An  additional 
promising  approach  is  to  use  a  feature  selection  algorithm  to  find  a  set  of  DDIs  that  best 
discriminate  between  true  and  false  PPIs  [30].  However,  these  methods  are  not  broadly 
applicable  to  non-model  organisms  or  comprehensive  enough  to  include  protein  interactions 
on  a  proteomic  scale. 


In  this  regard,  reconstitution  methods  provide  a  framework  that  does  not  a  priori  require 
additional  data  and  is  applicable  on  a  genomic  scale  to  any  organism  provided  a  PPI  dataset 
exists  [20,21,23].  The  aim  of  these  methods  is  to  identify  small  sets  of  potential  DDIs  that 
reconstitute  the  complete  original  set  of  PPIs.  Overall,  the  aim  of  the  maximum- specificity 
set  cover  (MSSC)  method  [23]  is  to  minimize  the  number  of  potential  false  positive 
interactions  regardless  of  the  number  of  DDIs  used  to  explain  the  PPI  set,  while  the  aim  of 
the  parsimonious  approach  (PA)  [20]  and  the  generalized  parsimonious  explanation  (GPE) 
method  [21]  is  to  minimize  the  number  of  selected  DDIs  regardless  of  the  introduction  of 
false  positive  interactions.  Despite  their  underlying  differences,  all  three  approaches  (MSSC, 
PA,  and  GPE)  have  been  shown  to  recover  DDIs  experimentally  identified  from  structural 
data.  This  leads  to  the  observation  that  true  DDIs  are  not  necessarily  rare,  promiscuous,  or 
parsimonious,  but  rather  are  distributed  between  the  extremes.  Consequently,  a  method  that 
reconstitutes  protein  interactions  based  on  different  degrees  of  rare,  promiscuous,  and 
parsimonious  DDIs  could  prove  beneficial. 

Our  contributions 

Here,  we  investigate  how  to  create  merged  sets  of  domain  annotations  and  how  to  use  these 
annotations  to  select  sets  of  DDIs  that  reconstitute  large-scale  PPI  networks  using  different 
true  positive  and  false  positive  selection  weights.  First,  we  introduce  a  novel  computational 
strategy  to  merge  protein-domain  annotation  data  from  multiple  databases,  a  needed 
capability  that  is  not  currently  available  elsewhere.  We  believe  that  merging  protein-domain 
annotation  data  from  multiple  sources  will  help  ensure  a  large  and  consistent  domain 
annotation  set  for  any  given  organism.  Second,  we  introduce  a  novel  heuristic  computational 
approach,  parameter-dependent  DDI  selection  (PADDS),  which,  given  a  set  of  PPIs,  extracts 
a  small  set  of  DDIs  that  explains  the  original  set  of  protein  interactions  and  is  not  biased 
towards  DDIs  with  either  low  or  high  occurrence  counts.  The  heuristic  scoring  system  for 
selecting  DDIs  can  be  tuned  between  favoring  known  interactions  (true  positives)  and 
penalizing  non-observed  interactions  (false  positives).  Given  that  the  domain-merging 
procedure  increases  the  number  of  domains  per  protein  and,  hence,  the  number  of  possible 
domain  combinations,  PADDS  was  designed  to  minimize  both  the  number  of  false  positive 
PPIs  and  the  size  of  the  extracted  DDI  set. 

Results  and  discussion 

Merged  domain  annotations  from  multiple  databases 

Our  strategy  combines  sequence  locations  and  name/label  information  to  construct  merged 
domain  annotation  sets  as  detailed  in  Methods.  Here,  we  illustrate  its  application  on  a  well- 
annotated  single-cell  organism.  We  created  a  merged  set  of  protein-domain  annotations  for 
yeast  ( Saccharomyces  cerevisiae )  using  sequences  of  5,884  proteins,3  downloaded  from  the 
Saccharomyces  Genome  Database  (SGD)  [31]  and  yeast  annotation  data  from  six  commonly 
used  annotation  databases:  PFAM-A  (release  25.0)  [32],  Superfamily  (SF)  [33],  SMART 
[34,35],  PRODOM  [36],  TIGRFAM  [37],  and  CDD  [15].  To  assign  protein-domain 
annotations,  we  either  used  curated  yeast  domain  annotations  (if  available)  [32,33]  or 
extracted  domain  annotations  based  on  an  E-value  threshold  of  <  10'  [15,34-37].  Although 
approximately  80%  of  the  proteins  had  at  least  one  domain  annotation  in  one  of  the  databases 
(Table  1),  this  level  of  annotation  density  cannot  be  expected  for  less-studied  organisms. 


Thus,  merging  protein-domain  annotation  data  from  multiple  sources  will  help  ensure  a 
maximally  large  and  consistent  domain  annotation  set  for  any  given  organism. 


Table  1  Yeast  protein-domain  annotation  data  from  six  publicly  available  annotation 
databases 


Database 

NP 

n 

% 

Ns 

n 

% 

No 

Nu 

Ad 

PFAM-A 

4,709 

80.0 

1,174,333 

40.2 

2,595 

2,553 

mm 

SF 

3,651 

62.1 

962,602 

33.0 

1,355 

1,307 

SMART 

3,023 

51.4 

455,523 

15.6 

392 

379 

IS 

PRODOM 

146 

2.5 

19,760 

0.7 

111 

111 

m 

TIGRFAM 

3,019 

51.3 

546,226 

18.7 

2,544 

1,944 

1.25 

CDD 

2,210 

37.6 

560,299 

19.2 

3,300 

731 

A  total  of  5,884  proteins  containing  a  total  of  2,921,809  amino  acids  were  downloaded  from 
the  Saccharomyces  Genome  Database  [31].  Np,  proteins  with  at  least  one  domain  annotation. 
Ns,  protein-domain  amino  acid  sequence  coverage.  No,  number  of  unique  domains  in  the 
original  database.  Nu,  number  of  unique  domains  in  the  unified  database.  Ad,  average  number 
of  domains  per  protein  in  the  unified  database. 

Table  1  shows  that,  despite  extensive  annotation  efforts,  each  database  characterized  each 
protein  by  a  small  average  number  of  domains.  It  also  shows  the  variation  in  the  number  of 
domains  extracted  among  the  different  databases,  as  well  as  the  variation  in  the  number  of 
proteins  with  domain  annotations. 

The  content  of  the  final  merged  domain  annotation  set  does  not  depend  on  the  order  in  which 
we  merged  the  databases.  However,  to  create  high-confidence  merged  annotation  sets  of 
different  sizes,  e.g.,  merged  annotation  from  two,  three,  ...,  six,  databases,  we  first  merged 
the  PFAM-A  and  SF  contents  because  they  contain  curated  domains  of  high  confidence.  We 
selected  the  merging  order  of  the  other  four  databases  randomly  and  Table  2  shows  the 
database  origins  of  the  six  merged  sets,  SET-1  to  SET-6. 

Table  2  Database  origin  of  merged  domain  annotation  sets _ 

Annotation  set  Domain  annotation  databases 

SET-1  PFAM-A  [32]  — 

SET-2  PFAM-A,  SF  [33] 

SET-3  PFAM-A,  SF,  SMART  [34,35] 

SET-4  PFAM-A,  SF,  SMART,  PRODOM  [36] 

SET-5  PFAM-A,  SF,  SMART,  PRODOM,  TIGRFAM  [37] 

SET-6  PFAM-A,  SF,  SMART,  PRODOM,  TIGRFAM,  CDD  [15] 


Table  3  shows  that  the  merging  procedure  increased  the  number  of  proteins  with  domain 
annotation  by  more  than  10%.  At  the  same  time,  the  average  number  of  domains  per  protein 
increased  from  1.05  to  2.44  (Table  3),  approaching  the  estimated  average  value  of  ~3  [10,14]. 
The  final  domain  annotation  set  created  using  the  database  merging  procedure  consisted  of 
4,114  unique  domains  (Additional  file  1).  The  domain  length  distribution  in  this  set  was 
similar  to  the  domain  length  distribution  from  each  of  the  six  original  databases  (data  not 
shown),  and  most  domains  ranged  in  length  between  100  and  300  amino  acids. 


Table  3  Yeast  protein-domain  annotation  data  after  merging  annotations  from  the  six 

databases _ 

Domain  annotation  set  Nu  Np  Ns  AD 


n  %  n  % 

Domain-merging  procedure 


SET-1 

2,595 

4,709 

80.0 

1,174,333 

40.0 

1.05 

SET-2 

2,847 

4,964 

84.4 

1,510,026 

51.7 

1.33 

SET-3 

2,806 

5,280 

89.7 

1,653,122 

56.6 

1.69 

SET-4 

2,843 

5,307 

90.2 

1,663,269 

56.9 

1.69 

SET-5 

4,182 

5,392 

91.6 

1,735,533 

59.4 

2.55 

SET-6 

4,114  5,395  91.7 

Naive  domain  merging 

1,756,481 

60.1 

2.44 

SET-6-NB 

10,297  5,395  91.7  1,756,481 

Domain  merging  based  solely  on  sequence  overlap 

60.1 

5.77 

SET-6-SB 

1,492 

5,395 

91.7 

1,756,481 

60.1 

1.32 

Database  sets  SET-1  through  SET-6  are  defined  in  Table  2.  SET-6-NB  (naive  merging) 
contained  the  union  of  unique  domain  annotations  from  the  six  databases  used  in  SET-6. 
SET-6-SB  contained  merged  domain  annotations  from  the  same  six  databases  as  in  SET-6, 
but  domains  in  this  set  were  merged  only  if  their  sequences  overlapped  and  they  shared  at 
least  ten  common  amino  acids  (i.e.,  domain  labels  were  not  considered).  Nu,  number  of 
unique  domains.  Np ,  proteins  with  at  least  one  domain  annotation.  Ns,  protein-domain  amino 
acid  sequence  coverage.  AD,  average  number  of  domains  per  protein. 

Evaluation  of  the  protein-domain  annotation  merging  strategy 

To  evaluate  the  merged  domains,  we  compared  our  results  to  those  obtained  with  two  simple 
alternative  strategies:  a  naive  domain-merging  strategy  (SET-6-NB)  and  a  naive  domain¬ 
merging  strategy  that  takes  into  account  sequence  overlaps  (SET-6-SB).  Because  the  number 
of  original  domains  is  constant,  all  three  merged  sets  (SET-6,  SET-6-NB,  and  SET-6-SB) 
yielded  the  same  number  of  proteins  with  domain  annotation.  However,  their  final  domain 
annotations  resulted  in  different  numbers  of  unique  domains,  as  well  as  different  average 
numbers  of  domains  per  protein  (Table  3).  SET-6-NB  consisted  of  over  10,000  unique 
domains,  with  an  average  number  of  5.77  domains  per  protein.  This  set  considerably 
overestimated  the  total  number  of  unique  domains,  as  many  of  its  10,000  domains 
represented  the  same  domain  with  a  slightly  different  label.  For  example,  the  naive  merging 
strategy  would  consider  the  formin  homology  2  domain  represented  in  three  annotation 
databases  (PFAM-A,  SF,  and  SMART)  as  different  domains,  because  their  domain  labels  and 
sequence  locations  are  not  identical  (see  Figure  IB).  By  merging  annotations  that  overlap  in 
at  least  10  continuous  amino  acids,  SET-6-SB  reduced  the  number  of  unique  domains  to 
1,492,  as  well  as  the  average  number  of  domains  per  protein  to  1.32.  Although  the  average 
number  of  domains  per  protein  was  greater  than  the  average  number  for  any  of  the  original 
databases,  the  total  number  of  unique  domains  was  underestimated.  For  example,  the 
sequence  location  of  ARM  repeat  overlaps  with  the  sequence  location  of  Drf_FH3  and 
Drf_GBD  domains  (see  Figure  1C)  and  this  strategy  would  merge  the  ARM  repeat  with 
Drf_FH3  and  also  with  DrfjGBD.  This  would  result  in  a  merged  domain  that  consists  of  the 
three  original  domains,  ARM  repeat,  Drf_FH3,  and  DrfjGBD,  even  though  these  three 
domains  are  different  and  should  not  have  been  merged.  Our  merging  strategy  does  not  suffer 


from  these  issues,  as  it  distinguishes  between  the  same  and  different  domains  that  cover  the 
same  sequence  location  based  on  their  domain  labels  (see  Figure  ID). 

These  results  showed  that  our  protein-domain-merging  strategy  did  not  overestimate  or 
underestimate  the  number  of  domains  per  protein.  However,  this  does  not  necessarily  imply 
that  the  merged  domain  annotation  is  biologically  more  relevant.  To  this  end,  we  compared 
our  merged  protein-domain  annotations  to  the  recently  released  high-confidence  annotations 
from  the  PFAM-A  database  (PFAM  release  26.0).  To  assess  the  amount  of  correctly  retrieved 
annotations  from  our  merged  set,  we  compared  them  to  the  following  two  independent 
subsets  of  the  new  PFAM  release:  1)  a  set  of  new  domain  annotations  that  replaced 
annotations  from  the  previous  PFAM  release  (PFAM  release  25.0)  and  2)  a  set  of  new 
domain  annotations  that  did  not  exist  in  the  previous  PFAM  release  but  have  a  corresponding 
annotation  in  the  merged  dataset.b  The  comparison  procedure  consisted  of  two  steps.  First,  for 
each  new  domain  annotation,  we  found  one  or  more  merged  domain  annotations  that  covered 
the  same  protein  sequence  location.  Then,  we  manually  compared  domain  labels  and 
descriptions  between  the  new  domain  and  the  merged  domains.  Out  of  17  new  domain 
annotations  in  the  first  subset  and  274  new  domain  annotations  in  the  second  subset,  we 
found  13  (76%)  and  202  (71%)  annotations,  respectively,  in  our  merged  dataset  (Additional 
file  2).  Because  these  account  for  >70%  of  the  new  PFAM-A  annotations,  it  demonstrates  the 
benefits  of  the  proposed  domain-merging  strategy. 

Use  of  annotation-based  domains  to  reconstitute  protein  interaction  networks 

The  introduction  of  a  more  complete  set  of  domain  annotations  across  all  interacting  proteins 
in  a  genome  would  allow  for  the  enumeration  of  all  domain  interactions  that  could  account 
for  an  original  set  of  PPIs.  Furthermore,  this  would  also  allow  for  a  comprehensive  evaluation 
of  DDIs  and  identification  of  an  optimum  DDI  set.  However,  this  process  has  the 
disadvantage  of  exponentially  increasing  the  number  of  domain  combinations.  To  circumvent 
this  problem,  our  PADDS  method  enumerates  only  a  subset  of  DDI  combinations  and 
evaluates  each  one  of  them  based  on  the  following  two  criteria:  1)  the  number  of  DDIs  used 
to  account  for  the  observed  PPIs  and  2)  the  number  of  non-observed  PPIs  (i.e.,  false 
positives)  introduced  by  the  combination  of  DDIs.  As  detailed  in  Methods,  this  selection 
depends  on  the  value  of  the  parameter  that  specifies  the  true/false  positive  biases,  denoted  as 
a\  a  e  [0.0,  1.0\,  where  an  a  of  0.0  favors  observed  protein  interactions  and  an  a  of  1.0 
maximally  penalizes  non-observed  interactions.  We  first  used  PADDS  to  investigate  the 
choice  of  selecting  different  values  of  the  parameter  a  on  retrieved  DDIs.  Here,  the  DDIs 
were  constructed  from  a  study  containing  multiple  organisms,  but  with  protein-domain 
annotations  from  a  single  database.  The  PADDS-extracted  DDIs  were  compared  to  other 
methods  and  validated  using  the  iPFAM  [38]  and  DOMINE  [39,40]  databases  of  known  and 
predicted  DDIs.  We  then  applied  the  algorithm  to  extract  DDIs  from  a  high-confidence  yeast 
PPI  dataset  using  merged  domain  annotations.  We  compared  the  results  from  our  analysis  to 
those  of  existing  reconstitution  methods  on  the  same  datasets. 

Multiple  organism  PPIs  characterized  by  a  single  domain  annotation  database 

To  determine  the  consequences  of  favoring  true  positives  or  penalizing  false  positives,  we 
examined  the  ability  of  PADDS  to  generate  different  sets  of  DDIs  that  can  reconstitute  a 
diverse  set  of  PPIs  from  multiple  organisms  for  different  values  of  a.  We  applied  PADDS  to 
a  collection  of  PPIs  from  68  different  organisms  as  assembled  by  Riley  et  al.  [27].  In  order  to 
compare  our  results  on  this  dataset  to  the  GPE  method,  previously  identified  as  giving  the 


best  reconstitution  results  on  this  dataset  [21],  we  converted  all  domains  to  the  same  PFAM- 
A  supra-domain  annotations  used  by  GPE  [21].  We  identified  10,025  proteins  with  PFAM-A 
supra-domain  annotations  and  20,625  PPIs  where  both  interacting  proteins  had  at  least  one 
domain  annotation.  This  dataset  yielded  a  total  number  of  26,113  potential  DDIs  that  could 
be  used  to  reconstitute  all  PPIs  and  the  average  number  of  domains  per  protein  for  this  dataset 
was  1.37. 

For  each  a  used  in  PADDS  to  extract  the  DDI  sets  (Additional  file  3),  we  ranked  the  DDIs 
based  on  their  corresponding  benefit  values  (see  Methods).  We  evaluated  each  set  of  top¬ 
scoring  DDIs  for  enrichment  of  DDIs  detected  in  crystal  structures  available  in  the  iPFAM 
database  (denoted  as  “known  DDIs”)  [38].  Out  of  26,113  potential  DDIs  from  the  Riley 
dataset,  691  DDIs  were  present  in  the  set  of  known  DDIs  [20].  Figure  2 A  shows  the  fraction 
of  known  DDIs  retrieved  for  different  values  of  a  in  different  top-ranked  DDI  sets.  The 
overall  number  of  extracted  known  DDIs  did  not  increase  linearly  with  the  number  of  DDIs 
analyzed,  and  the  total  retrievable  number  was  less  than  70%  of  the  known  set.  Additionally, 
the  number  of  known  DDIs  retrieved  varied  in  a  non-linear  fashion  with  a,  indicating  that  the 
extraction  procedure  was  sensitive  to  the  selection  weights  for  both  observed  and  non¬ 
observed  interactions.  These  observations  imply  a  non-trivial  solution  to  the  optimal  DDI 
extraction  problem.  We  also  noted  that  the  largest  number  of  known  DDIs  were  always 
retrieved  in  sets  for  which  a  was  not  at  its  extreme  values  of  0.0  or  1.0.  For  the  small  to 
intermediate  size  sets  between  1,000  to  4,000  analyzed  DDIs,  the  maximum  retrievable 
number  occurred  at  a  values  -0.10. 


Figure  2  Enrichment  of  “known”  (iPFAM)  domain-domain  interactions.  Evaluation  of 
the  top-scoring  domain-domain  interactions  (DDIs)  extracted  by  the  parameter-dependent 
DDI  selection  (PADDS)  and  the  generalized  parsimonious  explanation  (GPE).  (A)  The 
fraction  of  known  DDIs  in  the  iPFAM  database  [38]  retrieved  by  PADDS  as  a  function  of  a 
and  the  number  of  top-scoring  DDIs.  (B)  Comparison  of  the  percentage  of  retrieved  iPFAM 
DDIs  using  PADDS  and  GPE  as  a  function  of  top-ranked  DDI  sets  ( i.e .,  recall).  (C) 
Comparison  of  the  fraction  of  retrieved  iPFAM  DDIs  using  PADDS  and  GPE  as  a  function  of 
the  iPFAM  DDI  set  and  top-ranked  DDI  sets  (i.e.,  precision).  For  the  GPE  sets,  we  used  the 
DDI  rank  information  provided  with  the  published  data  that  includes  their  designated  high- 
confidence  (GPE-HC)  and  low-confidence  (GPE-LC)  sets  [21].  We  have  also  indicated  the 
best  results  achievable  with  any  a  value,  typically  achieved  for  a  =  0.1. 


Figure  2B,  Figure  2C,  and  in  Additional  file  4:  Table  SI  show  the  difference  in  retrieving 
known  DDIs  between  PADDS  and  the  published  results  using  GPE  methods.  For  PADDS, 
we  show  both  the  best  results  using  selected  a  values  and  average  results  using  non-extreme 
values  of  a.  For  this  dataset,  PADDS  was  more  successful  (13%  -  27%)  than  the  best  GPE 
method  in  the  majority  of  the  a  selections  away  from  the  extreme  values.  This  implies  that 
the  ability  to  modulate  the  preference  for  known  interactions  and  tolerance  of  non-observed 
interactions  was  an  important  factor  in  the  process  of  DDI  extraction  and  the  overall  ability  to 
extract  known  DDIs.  While  there  is  always  a  dataset  dependency  on  these  results,  it  was  also 
clear  that  relaxing  either  extreme  selection  ( a  =  0.0  or  a  =  1.0 )  retrieved  more  known  DDIs 
(Figure  2A). 

Although  DDI  extraction  can  be  optimized  for  each  dataset  by  varying  a,  one  cannot  in  all 
cases  independently  determine  an  optimal  a  value.  Hence,  we  were  also  interested  in  the 
robustness  of  the  algorithm  and,  in  particular,  evaluating  extracted  DDIs  that  are  independent 
of  a.  We  used  the  DOMINE  database  as  a  comprehensive  source  of  known  and  predicted 


DDIs  derived  from  multiple  sources  [39,40]  to  construct  DDIs  (Additional  file  4:  Validation 
of  extracted  core  DDIs  section  and  Additional  file  4:  Figure  SI).  The  analysis  showed  that 
there  was  a  large  overlap  among  the  sets  of  extracted  DDIs  for  different  values  of  a, 
indicating  robustness  of  the  algorithm  to  choices  of  a.  Furthermore,  the  PADDS  algorithm 
was  capable  of  providing  parameter-independent  and  unique  DDI  predictions  not  derivable 
from  high-confidence  results  of  other  computational  procedures.  To  further  characterize 
PADDS-extracted  DDIs,  we  next  examined  the  high-confidence  protein  interaction  network 
from  a  single  organism  (yeast)  with  our  merged  domain  annotations. 

Single  organism  PPIs  characterized  by  multiple  annotation  databases 

To  evaluate  the  influence  of  the  underlying  set  of  PPIs  and  protein-domain  annotation  data  on 
the  DDI  extraction  process,  we  reconstructed  a  set  of  high-confidence  yeast  PPI  data  created 
by  the  Interaction  Detection  Based  On  Shuffling  (IDBOS)  procedure  at  a  5%  false  discovery 
rate  [8,41].  We  have  previously  shown  that  this  dataset  identified  binary  interactions  as  well 
as,  or  better  than,  the  high-confidence  consolidated  yeast  two-hybrid  set  or  other  high- 
confidence  datasets  based  on  affinity  purification  followed  by  mass  spectrometry  [8,41].  The 
IDBOS  dataset  consists  of  8,401  PPIs  between  1,295  proteins.  For  protein-domain  annotation 
of  the  IDBOS  dataset,  we  used  our  merged  protein-domain  annotation  data  (SET-1  to  SET-6) 
as  described  above.  The  average  number  of  domains  per  protein  for  the  IDBOS  dataset  was 
2.69.  In  Additional  file  4:  Table  S2  shows  the  complete  statistics  for  the  domain  annotations 
in  the  IDBOS  dataset. 

Evaluating  domain  interactions  for  high-confidence  yeast  protein  interactions 

We  evaluated  the  merged  domain  annotation  sets  using  three  reconstitution  methods: 
PADDS,  MSSC,  and  GPE.  We  used  PADDS  with  parameter  a  e  [0.0,  1.0 ]  in  0.1  increments, 
ranked  the  extracted  DDIs  based  on  the  corresponding  benefit  value,  and  extracted  the 
corresponding  ranked  data  for  MSSC  and  GPE  (see  Methods).  Although,  by  construction,  all 
obtained  DDI  sets  accounted  for  all  original  PPIs,  different  methods  yielded  DDI  sets  of 
different  sizes  for  each  of  the  six  domain  annotation  schemes,  with  PADDS  consistently 
extracting  the  smallest  sets  of  DDIs.  Additional  file  4:  Figure  S2,  Additional  file  4:  Table  S3, 
and  Additional  file  4:  Table  S4  in  Additional  file  4  provide  the  complete  results  of  this 
analysis.  However,  despite  of  their  aim  to  minimize  the  number  of  false  positives,  all  three 
methods  identified  a  much  larger  number  of  novel  (predicted)  PPIs  than  what  could  be 
expected  to  occur  in  a  living  cell  [1,2,4, 5].  Even  if  we  assume  that  all  predicted  interactions 
represent  plausible  physical  interactions  between  proteins,  e.g.,  a  specified  PPI  would  occur  if 
two  proteins  were  in  close  proximity,  it  is  likely  that  in  their  native  environment  they  are 
under  additional  biological  regulation.  Thus,  one  cannot  assume  that  all  proteins  that  contain 
interacting  domains  will  necessarily  interact  within  the  cell,  due  to  the  existence  of  alternative 
regulatory  mechanisms  that  control  these  interactions  [42] . 

To  evaluate  the  performance  of  the  different  reconstitution  methods  on  different  domain 
annotation  sets,  we  investigated  the  ability  of  each  method  to  extract  DDIs  that  accounted  for 
the  given  PPIs  while  limiting  the  number  of  false  positive  PPIs.  For  this  calculation,  we 
defined  the  set  of  true  non-interacting  protein  pairs  as  the  set  of  all  pairwise  protein 
interactions  minus  the  known  true  interaction  set  [18,20,30],  see  Methods.  Based  on  these 
definitions,  we  could  then  ascertain  true  and  false  PPI  predictions  for  each  extracted  set  of 
DDIs  and  construct  the  corresponding  Receiver  Operating  Characteristic  (ROC)  curves  from 
an  analysis  of  true  positive  and  false  positive  rates.  PADDS  outperformed  the  other  two 


methods  for  all  six  annotation  sets  (Additional  file  4:  Figure  S3  and  Additional  file  4:  Figure 
S4).  The  largest  differences  were  most  evident  for  the  larger  annotation  sets,  e.g.,  SET-6, 
where  the  other  methods  lack  PADDS’s  flexibility  to  extract  a  small  number  of  DDIs  while 
limiting  the  introduction  of  non-observed  interaction. 

PADDS  increases  diversity  of  DDIs  when  provided  with  sufficient  amounts  of 
annotations 

To  investigate  the  relationship  between  the  size  of  the  domain  annotation  sets  and  the 
obtained  results,  we  compared  the  set  of  DDIs  (accounting  for  the  IDBOS  set  of  PPIs) 
extracted  by  PADDS  for  different  values  of  a.  We  found  that,  for  SET-1,  approximately  80% 
of  the  DDIs  were  represented  in  all  extracted  sets  and  were  not  dependent  on  the  particular 
value  of  a  (a  similar  result  was  observed  in  the  multi-organism  study).  Figure  3  shows  that, 
with  an  increasing  amount  of  domain  annotation  data,  the  number  of  DDIs  represented  in  all 
extracted  sets  decreased,  and  for  SET-6  only  -30%  of  the  DDIs  were  represented  in  all  sets. 
In  contrast,  we  observed  an  increased  percentage  of  DDIs  represented  by  a  single  value  of  a 
with  larger  annotation  sets,  implying  that  this  parameter  introduced  significant  variations 
among  the  extracted  DDI  sets  when  more  domain  annotation  data  were  available.  These 
observations  suggest  that,  for  limited  amounts  of  domain  annotation  data,  computational 
methods  are  forced  to  select  particular  DDIs,  as  these  DDIs  are  the  only  ones  that  could 
account  for  certain  PPIs.  Using  additional  domain  annotation  data  removed  this  bias,  as  more 
than  one  DDI  accounted  for  a  larger  number  of  PPIs. 


Figure  3  Overlap  between  extracted  domain-domain  interaction  sets  for  different  values 
of  parameter  a.  The  graph  indicate  fractional  overlaps  between  sets  of  extracted  domain- 
domain  interactions  (DDIs)  for  the  six  different  domain  annotation  schemes  defined  in  Table 
2,  for  different  sets  of  a  values.  As  the  underlying  set  of  PPIs,  we  used  a  high-confidence 
yeast  PPI  data  set  created  by  the  Interaction  Detection  Based  On  Shuffling  (IDBOS) 
procedure  at  a  5%  false  discovery  rate  [8,41]. _ 


In  summary,  PADDS  extracted  the  smallest  set  of  DDIs  for  this  extensively  annotated  high- 
confidence  network.  However,  similar  to  other  methods,  regardless  of  how  we  biased  our 
benefit  score  in  the  extraction  process  or  how  efficient  PADDS  was  in  extracting  true 
positives,  a  large  number  of  non-observed  PPIs  resulted  from  these  DDI  selections. 

Conclusions 

Proteins  consist  of  one  or  more  domains,  and  physical  interactions  between  proteins  arise 
from  interactions  between  their  specific  domains.  Given  that  there  is  more  consistency  in 
DDIs  detected  from  different  experiments  than  in  the  corresponding  PPIs,  the  hope  is  that  an 
in-depth  analysis  of  DDIs  would  improve  our  understanding  of  PPIs  and  give  us  better 
insights  into  cellular  function,  disease,  and  evolution.  However,  determining  which  particular 
DDI  mediates  any  given  PPI  is  challenging,  because  currently  available  experimental  DDI 
data  accounts  for  only  a  small  fraction  of  all  existing  PPIs.  In  this  paper,  we  present  two 
contributions  to  the  field  of  domain  interaction  analysis. 

First,  we  introduced  a  novel  computational  strategy  that  systematically  merged  domain 
annotation  data  from  multiple  databases;  a  needed  capability  that  is  not  currently  available 
elsewhere.  By  combining  sequence  locations  with  domain  name  and  labeling  information, 


our  merging  strategy  was  less  likely  to  grossly  overestimate  or  underestimate  the  number  of 
domains  per  protein.  We  showed  that  merging  domain  annotations  from  six  different 
databases  increased  the  average  number  of  domains  per  proteins,  bringing  it  closer  to  the 
estimated  true  value.  We  believe  that  our  merging  strategy  can  ensure  a  large  and  consistent 
domain  annotation  set  for  any  given  organism. 

The  second  contribution  detailed  here  is  the  development  of  PADDS,  a  novel  computational 
method  that,  given  a  set  of  PPIs,  can  identify  a  small  set  of  potential  DDIs  that  account  for 
the  provided  set  of  PPIs  and  is  not  biased  towards  DDIs  with  low  or  high  occurrence  counts. 
We  showed  that  PADDS  was  more  successful  in  extracting  known  DDIs,  i.e.,  DDIs  that  have 
been  determined  experimentally  from  crystal  structures,  than  the  MSSC  method  and  the 
current  best  reconstitution  method,  GPE. 

It  was  also  noteworthy  that  the  choice  of  a  value  influences  the  number  of  known  DDIs 
retrieved.  For  the  PPI  dataset  aggregated  from  multiple  organisms  from  different  sources  and 
annotated  by  PFAM  only,  we  retrieved  the  largest  number  of  known  DDIs  for  small  a  values 
in  the  range  of  0.05-0.10.  We  interpreted  this  to  indicate  that  a  small  tolerance  of  false 
positives  in  the  PPI  reconstitution  procedure  relaxed  constraints  in  the  DDI  selection  process 
sufficiently  enough  to  gamer  additional  known  DDIs,  yet  avoiding  overwhelming  the 
solution  with  too  many  non-observed  interactions.  This  result  also  hints  that  the  hypothesis 
that  all  protein  interactions  must  strictly  be  composed  of  pairwise  domain  interactions  could 
be  relaxed.  We  further  found  that  increased  amounts  of  domain  annotation  data  increased  the 
diversity  of  DDIs  that  could  account  for  a  single  PPI.  As  a  result,  for  the  densely  annotated 
high-confidence  yeast  PPI  network,  we  found  that  less  than  30%  of  the  extracted  DDIs  were 
present  in  all  extracted  sets.  This  last  observation  indicates  that,  once  we  have  a  sufficient 
amount  of  annotation  data,  more  diverse  DDI  sets  can  be  used  to  reconstitute  PPI  sets  equally 
well.  As  currently  available  reconstitution  methods  identify  only  a  single  set  of  DDIs  that 
account  for  a  given  set  of  PPIs,  a  method  that  is  able  to  identify  multiple  DDI  sets  without  a 
priori  bias  towards  DDIs  with  either  low  or  high  occurrence  counts  is  a  needed  capability  that 
is  not  currently  available  elsewhere. 

Methods 

Domain  annotation  and  interaction  datasets 

We  used  a  number  of  available  genome-scale  annotation  databases  that  contain  domain 
information.  Each  database  collates  information  based  on  different  objectives  and  criteria. 
PFAM-A  contains  manually  curated  protein  families  and  provides  assignments  of  high- 
confidence  domain  annotations  through  family-specific  domain  gathering  thresholds  [32];  we 
used  PFAM-A  release  25.0.  SF  contains  structural  and  functional  domain  annotation,  derived 
from  the  structural  protein  domains  from  the  SCOP  (Structural  Classification  of  Protein) 
database  [33].  SMART  provides  annotation  of  signaling  domains  [34,35],  while  PRODOM 
[36]  and  TIGRFAM  [37]  provide  protein  domain  family  annotations  constructed 
automatically  by  sequence  homology.  CDD,  which  provides  functional  protein  annotations, 
also  lists  domain  annotations  using  multiple  sequence  alignment  models  for  domains  and 
proteins,  as  well  as  curated,  structural  domains  and  domains  imported  from  a  number  of  other 
protein-domain  annotation  databases  (e.g.,  PFAM,  SMART,  and  TIGRFAM)  [15]. 


Similarly,  we  used  three  databases  to  verify  the  extracted  domain  interactions:  1 )  the  iPFAM 
database  that  contains  domain-domain  interactions  obtained  from  the  PDB  structures  [38],  2) 
the  domain-domain  and  peptide-mediated  interactions  of  known  3D  structure  database 
(3DID)  [43,44],  and  3)  a  comprehensive  collection  of  known  and  predicted  DDIs  (DOMINE) 
[39,40], 

Protein  -  domain  annotation  merging  strategy 

Our  merging  strategy  combines  protein-domain  annotation  data  through  the  following  three 
consecutive  steps  (Figure  4): 


Figure  4  Protein-domain  annotation  merging  procedure.  An  illustration  of  the 
computational  procedure  used  to  merge  protein-domain  annotation  data  from  multiple 
databases  for  a  single  protein  P  (consisting  of  n  amino  acids)  and  domain  annotation  data 
from  three  databases:  DB1,  DB2,  and  DB3.  INPUT:  Protein  sequences  and  protein-domain 
annotations  from  one  or  more  databases.  PROCESSING:  The  annotation  data  were  merged  in 
three  consecutive  steps.  In  Step  I,  tandem  domains  within  each  protein  (and  for  each 
database)  were  merged  and  represented  as  a  continuous  domain  with  the  same  domain  label 
as  the  tandem  domains.  In  Step  II,  annotation  data  between  all  pairs  of  databases  were 
merged.  In  Step  III,  all  pairs  from  Step  II  were  merged  into  a  final  annotation  set.  In  this  step, 
new  domain  labels  were  assigned  to  the  sets  of  merged  domains.  OUTPUT:  The  output  of  the 
annotation  merging  procedure  consists  of  1)  a  set  of  new  (merged)  domain  labels  assigned  to 
the  protein,  2)  a  mapping  between  the  new  and  original  domain  labels,  and  3)  a  list  of 
merging  exceptions.  Based  on  these  lists,  one  may  (re)define  sets  of  labels  that  should  be 
treated  as  equivalent  or  non-equivalent  and  iterate  through  the  complete  domain  annotation 
merging  procedure  (ITERATION). 


Step  I  -  Domain  repeats  merging  procedure 

In  the  first  step,  we  merged  domain  repeats  within  each  database.  Domain  repeats 
represent  two  or  more  domains  from  the  same  domain  family  that  appear  in  tandem  [45]. 
Different  proteins  may  have  domain  repeats  that  consist  of  a  different  number  of  domains 
from  the  same  domain  family.  In  annotation  databases,  domain  repeats  are  represented 
either  as  a  set  of  domains  that  appear  in  tandem  or  as  a  single  domain  that  corresponds  to 
the  union  of  the  tandem  domains.  Our  procedure  aimed  to  represent  each  domain  repeat  as 
a  single  domain.  This  ensured  a  uniform  representation  of  all  domain  repeats  and 
ultimately  removed  inconsistencies  among  databases.  To  this  end,  for  each  database  and 
for  each  protein  of  interest,  our  method  flagged  domains  with  identical  labels  and  assigned 
them  to  a  single  domain.  The  new  domain  inherited  all  labels  of  its  members.  Furthermore, 
its  sequence  was  represented  by  a  continuous  amino  acid  sequence  containing  both  the 
member  domains  and  the  amino  acid  sequences  between  the  domains  (Figure  4,  Step  I). 
Although  tandem  domains  that  consist  of  different  numbers  of  domains  from  the  same 
family  may  have  different  functional  roles,  our  current  implementation  did  not  distinguish 
between  them.  We  chose  this  strategy  because  the  domain  annotation  data  retrieved  from 
most  annotation  databases  depend  on  an  E-value  threshold  and,  hence,  it  was  not  possible 
to  accurately  and  indisputably  determine  how  many  domains  appear  in  tandem. 
Furthermore,  the  E-value  threshold  could  also  influence  the  length  of  the  amino  acid 
sequence  between  tandem  domains  that  were  merged  together.  For  this  reason,  we  did  not 
impose  a  limit  on  the  minimum  or  maximum  sequence  length  between  tandem  domains  in 
the  merging  procedure. 

Step  II  -  Merging  annotation  data  between  pairs  of  databases 


In  the  second  step,  we  merged  annotation  data  between  each  pair  of  databases,  including 
each  database  with  itself.  This  step  ensured  that  all  possible  domain  pairs  were  considered 
and  it  removed  any  possible  effect  of  the  order  in  which  domains  and  databases  were 
merged.  In  this  step,  for  each  protein,  we  grouped  domain  annotations  into  sets  such  that 
domains  within  a  set  had  equivalent  domain  labels  and  overlapped  with  approximately  the 
same  segment  of  the  protein  sequence,  i.e.,  it  matched  at  least  ten  continuous  amino  acids. 
The  final  merged  domains  were  not  sensitive  to  the  examined  threshold  variations  ranging 
from  1-30  amino  acids.  Domain  annotations  within  each  set  were  merged  into  a  single 
domain.  The  new  domain  inherited  all  domain  labels  of  its  members,  i.e.,  it  became  a 
multi-label  domain.  Furthermore,  the  sequence  of  the  newly  defined  domain  represented  a 
continuous  amino  acid  sequence  that  consisted  of  the  union  of  all  amino  acid  sequences  of 
the  member  domains  (Figure  4,  Step  II).  By  using  a  combination  of  domain  labels  and 
domain  sequences  in  the  merging  procedure,  the  method  ensured  that  potentially  different 
domains  that  covered  approximately  the  same  segment  of  a  protein  sequence  were  not 
merged  together. 

To  determine  if  two  domain  labels  were  equivalent,  we  first  represented  each  one  of  them 
as  an  array  of  words  contained  within  each  label.  Because  domain  labels  often  contain 
general  common/trivial  words  (e.g.,  a,  the,  domain,  family,  like,  member,  of,  via,  within), 
these  words  were  excluded  from  the  domain  labels.  Next,  we  compared  all  words  from  the 
first  array  to  all  words  from  the  second  array  to  determine  whether  they  consisted  of 
identical  words  or  words  that  were  contained  within  each  other  (e.g.,  “ kinases ”  and 
“ pkinase ”).  If  such  a  pair  of  non-trivial  words  was  detected,  the  two  corresponding  domain 
labels  were  considered  equivalent.  Clearly,  this  method  of  determining  the  equivalence  of 
two  labels  does  not  guarantee  a  correct  outcome.  Therefore,  our  method  allows  users  to 
specify  pairs  of  labels  that  should  be  considered  equivalent  as  well  as  pairs  of  labels  that 
should  be  considered  non-equivalent  (Figure  4,  “Predefined  label  relationships”).  This 
functionality  also  overcame  problems  that  arose  from  labeling-scheme  variations  that  were 
not  always  recognized  by  computational  procedures  for  string  comparison.  For  example, 
the  labels  ttfmf>  and  “formyltransferase”  are  equivalent,  where  the  first  word  represents  an 
abbreviation  of  the  second.  However,  these  words  are  neither  the  same  nor  contained 
within  each  other,  and  their  equivalence  cannot  be  detected  solely  by  string  comparison.  A 
computational  procedure  that  could  detect  such  equivalence  based  on  string  manipulation 
would  yield  many  false  positives  and  was  not  pursued. 

Step  III  -  Creating  a  final  annotation  set 

In  the  third  step,  all  pairs  of  domains  from  all  databases  in  the  second  step  were  merged 
into  a  final  annotation  set.  The  merging  procedure  was  similar  to  the  one  from  Step  II. 
Here,  however,  each  domain  annotation  set  already  contained  some  merged  domains  from 
Step  II.  Therefore,  for  merged  domains,  the  representative  sequence  was  the  sequence 
derived  in  Step  II  and  the  representative  label  was  a  multi-label  annotation,  whereas,  for 
domains  that  had  not  been  merged,  the  original  label  was  used  as  their  annotation.  For 
each  protein,  we  grouped  the  domain  annotations  into  sets  such  that  domains  within  the 
same  set  had  equivalent  domain  labels  and  overlapped  with  the  same  sequence  locations. 
Then,  we  merged  the  domain  annotations  within  each  set  into  a  single  domain.  Finally,  we 
assigned  new  domain  labels  to  each  set  of  merged  domains  (Figure  4,  Step  III). 

By  merging  all  joined  pairs  from  Step  II,  it  was  possible  to  detect  additional  overlap 
between  domain  labels  that  were  not  detected  in  Step  II.  For  example,  given  three  domains 
with  labels  “ abc-smc5 ,”  “ abc-atpase ,”  and  “ smc ”  from  three  different  databases,  the 
computational  procedure  in  Step  II  would  identify  the  domain  labels  “ abc-smc5 ”  and 
“ abc-atpase ”  as  equivalent,  the  labels  “ abc-smcS ”  and  “me”  as  equivalent,  and  the  labels 
“abc-atpase”  and  “smc”  as  not  equivalent.  Only  the  first  and  second  pairs  of  domains 


would  therefore  be  merged,  assuming  that  all  three  domains  covered  approximately  the 
same  sequence  stretch.  However,  in  Step  III,  the  computational  procedure  would 
determine  that  the  merged  domain  “ abc-atpase  abc-smc5 ”  was  equivalent  to  the  merged 
domain  “ abc-smc5  smc ,”  and  that  the  “ smc ”  and  “ abc-atpase ”  domains  would  thus  be 
identified  as  equivalent,  even  though  this  was  missed  in  Step  II  (Figure  4). 

For  all  string  (word)  comparison  procedures,  we  used  string  comparison  algorithms  available 
in  a  standard  C++  library. 

For  each  protein  of  interest,  our  method  outputs  the  newly  assigned  domain  labels  and  their 
corresponding  sequence  locations.  Additionally,  the  procedure  provides  a  list  (dictionary)  that 
contains  mappings  between  the  new  domain  labels  and  labels  from  the  original  databases,  as 
well  as  a  list  of  domain  labels  that  overlapped  in  sequence  but  were  not  similar  enough  to  be 
merged.  These  lists  can  be  used  to  redefine  a  set  of  labels  that  should  be  treated  as  the 
equivalent  or  different  (Figure  4). 

Definition  of  true  and  false  positive/negative  predicted  PPIs 

In  this  work,  we  have  adapted  an  operational  definition  of  true  and  false  PPI  predictions 
based  on  what  is  known  about  a  given  protein  interactions  network.  Given  a  set  of  n  proteins 
and  m  known,  experimentally  detected  pairwise  interactions  among  these  proteins  (the 
interacting  set),  we  defined  the  set  of  non-interacting  protein  pairs  as  the  set  that  includes  all 
pairwise  PPIs  among  the  n  proteins,  except  for  the  known  interactions.  Hence,  the  number  of 

non-interacting  PPIs  is  given  by  Qj  —  m  [18,20,30].  We  then  defined  a  true  positive  (TP) 

PPI  prediction  as  a  predicted  PPI  that  belongs  to  the  interacting  set.  Similarly,  a  false  positive 
(FP)  PPI  is  defined  as  a  predicted  PPI  that  belongs  to  the  non-interacting  set.  A  true  negative 
(TN)  PPI  prediction  is  defined  as  a  predicted  non-interacting  protein  pair  that  belongs  to  the 
non-interacting  set.  A  false  negative  (FN)  PPI  prediction  is  defined  as  a  predicted  non¬ 
interacting  protein  pair  that  belongs  to  the  interacting  set.  The  true  positive  rate  is  then 
defined  as  TP/(TP  +  FN)  and  the  false  positive  rate  as  FP/(FP  +  TN). 

Parameter-dependent  DDI  selection  (PADDS)  algorithm 

PADDS  was  designed  to  select  sets  of  DDIs  that  can  reconstitute  a  given  protein  interaction 
network.  Specifically,  for  each  potential  DDI  and  its  corresponding  PPIs,  we  assessed  the 
consequences  of  selecting  that  particular  DDI  versus  each  one  of  the  other  possible  DDIs  that 
account  for  the  same  PPIs  (we  denoted  these  DDIs  as  alternative  DDIs).  Instead  of  exploring 
all  possible  combinations  of  alternative  DDIs,  PADDS  explores  only  a  subset  of 
enumerations  consisting  of  currently  evaluated  DDIs  and  their  best  alternatives,  i. e. , 
alternatives  that  best  satisfy  the  evaluation  criteria.  If  the  evaluated  DDI  was  better  than  any 
alternative,  it  was  selected  as  a  PPI  mediator  and  assigned  a  benefit  score.  Already  selected 
DDIs,  as  well  as  the  PPIs  they  accounted  for,  were  never  re-evaluated,  further  limiting  the 
number  of  combinations  to  be  enumerated.  The  final  constructed  set  of  domain  interactions 
represented  a  minimal  DDI  set  that  accounted  for  all  PPIs,  while  attempting  to  minimize  false 
positives. 

In  contrast  to  existing  reconstitution  methods,  PADDS  does  not  a  priori  reward  or  penalize 
the  most  rare,  promiscuous,  or  parsimonious  set  of  interactions.  Instead,  it  biases  the  benefit 
of  each  selected  domain  interaction  towards  either  preferring  observed  PPIs  (true  positives) 


or  penalizing  non-observed  PPIs  (here  categorized  as  false  positives).  Thus,  depending  on  the 
value  of  the  parameter  that  specifies  the  true/false  positive  biases  (denoted  as  a;  a  £  [0.0, 
1.0]),  PADDS  extracts  multiple  sets  of  potential  DDIs  that  can  explain  the  original  set  of 
PPIs.  An  a  of  0.0  favors  observed  protein  interactions  and  an  a  of  1.0  maximally  penalizes 
non-observed  interactions.  In  addition,  PADDS  also  identifies  a  set  of  robust,  core  DDIs  that 
are  independent  of  the  parameter  a. 


Algorithm  and  implementation  details 


Let  Oij  denote  the  number  of  observed  interacting  protein  pairs,  where  one  protein  contains 
domain  i  and  the  other  contains  domain  j,  and  let  Ny  denote  the  number  of  all  possible  non¬ 
interacting  protein  pairs,  where  one  protein  contains  domain  i  and  the  other  contains  domain 
j.  The  association  score  Aij  [28],  which  represents  the  probability  of  interaction  between 
domains  i  and  j,  is  defined  as: 


An  = 


Oij+Nij 


■(1) 


Let  a  denote  a  parameter  with  a  value  in  the  [0.0,  1.0 ]  range  that  specifies  the  amount  of 
tolerable  non-interacting  protein  pairs.  We  evaluated  the  probability  of  the  occurrence  of  a 
domain  pair  i  and  j  in  a  set  of  PPIs  as  a  modified  association  score  A"1  : 


a. 


O,  ,  +  OC  A  ;  : 


(2) 


For  a  =  0.0,  A™  equals  the  number  of  observed  DDIs,  i.e.,  the  number  of  PPIs  in  which  one 
protein  contains  domain  i  and  the  other  contains  domain  j,  whereas  for  a  =  1.0,  A™  denotes 
the  probability  of  interaction  between  domains  i  and  j  [defined  in  Equation  (1)]  multiplied  by 
the  number  of  domain  interaction  occurrences.  Thus,  for  a  =  0.0,  the  modified  association 
score  corresponds  to  domain  interactions  that  explain  the  largest  number  of  PPIs,  while  for  a 
=  1.0,  the  score  corresponds  to  domain  interactions  that  do  not  introduce  large  number  of 
false  positive  PPIs.  We  multiplied  the  probability  of  interaction  by  Oy  to  differentiate 
between  DDIs  that  have  the  same  probability  by  assigning  a  higher  score  to  those  DDIs  that 
account  for  a  larger  number  of  PPIs. 

Benefit  definition 


The  benefit  of  interaction  between  two  domains  i  and  j  represents  the  propensity  that  these 
two  domains  mediate  protein  interactions.  We  defined  the  benefit  by  combining  the  above 
modified  association  score  with  a  term  that  takes  into  account  the  co-occurrence  of  domains, 
because  domains  that  appear  together  within  a  protein  often  interact  [46,47].  Let  Cy  denote 
the  number  of  proteins  in  which  domains  i  and  j  co-occur,  and  let  maxCy  denote  the 
maximum  number  of  co-occurring  domains  observed  in  a  given  set  of  proteins.  We  then 
defined  the  benefit  By  of  the  interaction  between  two  domains  i  and  j  as: 


Using  different  values  of  a,  one  can  rank  the  same  set  of  DDIs  differently  based  on  their  Blf 
value. 


Iterative  evaluation  of  selected  DDIs 

PADDS  goal  is  to  extract  a  set  of  DDIs  such  that:  1)  these  DDIs  account  for  a  given  set  of 
PPIs  and  2)  the  sum  of  benefits  of  this  set  is  higher  than  the  sum  of  benefits  of  any  alternative 
DDI  set  of  the  same  size  that  explains  the  same  PPIs.  The  optimal  solution  for  this  problem 
would  require  the  exhaustive  enumeration  of  all  possible  combinations  of  DDIs  that  account 
for  the  original  set  of  PPIs.  Because,  in  practice,  the  exhaustive  enumeration  is 
computationally  unfeasible,  PADDS  uses  a  heuristic  solution.  Given  a  set  of  PPIs,  a  list  of 
protein-domain  annotations,  and  a  user-specified  parameter  a,  PADDS  calculates  B,j  for  each 
potential  DDI  (Figure  5-1).  For  each  DDI  and  the  corresponding  PPIs  represented  by  this 
DDI,  PADDS  evaluates  the  consequences  of  selecting  this  DDI  versus  each  of  its  alternative 
DDIs.  This  evaluation  yields  a  small  set  of  DDIs,  called  the  final  set. 


Figure  5  Example  of  domain-domain  interaction  extraction.  I:  Given  a  set  of  protein- 
protein  interactions  (PPIs)  and  a  protein-domain  annotation  scheme,  PADDS  transformed  all 
PPIs  into  the  corresponding  set  of  domain-domain  interactions  (DDIs)  and  calculated  the 
benefit  value  B,/  for  all  DDIs.  II:  The  five  steps  involved  in  the  DDI  iterative  evaluation 
procedure  is  illustrated  using  interactions  between  domains  D1  and  D3.  Ill:  After  PADDS 
performed  the  DDI  evaluation  procedure  for  all  other  DDIs,  the  results  were  examined  to 
select  the  final  set  of  DDIs  that  can  reconstitute  the  PPIs.  PI,  . . .,  P7  denote  proteins  and  Dl, 

. . .,  D8  denote  domains.  The  benefit  By  and  the  reassessed  benefit  By  associated  with  the 
interaction  between  domains  ij  were  calculated  using  Equations  (3)  and  (4),  respectively. 


The  evaluation  process  consists  of  five  phases.  In  the  first  phase,  PADDS  adds  a  DDI  of 
interest  into  a  set  called  the  main  set.  In  addition,  PADDS  evaluates  all  alternative  DDIs,  i.e., 
DDIs  that  overlap  with  the  DDI  of  interest,  and  adds  the  one  with  the  highest  benefit  to  a  set 
called  the  alternative  set.  Note  that  the  main  and  alternative  sets  are  initially  empty  (Figure  5 
-  II,  Step:  0  and  Step:  1).  In  the  second  phase,  PADDS  evaluates  all  DDIs  that  overlap  with 
the  DDIs  from  the  alternative  set  (Figure  5  -  II,  Step:  2  to  Step:  4).  PADDS  always  evaluates 
only  the  DDIs  that  are  not  already  contained  in  the  alternative,  main,  or  final  sets.  However, 
in  the  evaluation  process,  PADDS  takes  into  consideration  that  DDIs  from  the  final  set 
already  account  for  a  particular  subset  of  PPIs.  Because  these  PPIs  should  not  be  used  for  the 
evaluation  of  potential  DDIs,  for  all  potential  DDIs  that  are  not  in  the  final  set,  PADDS 
calculates  the  reassessed  benefits  B\ ■  as: 


Gj  Uj 


Bjj  s  •  Eij, 
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where  £)/  represents  the  number  of  observed  interacting  protein  pairs  containing  domains  i 
and  j  that  have  already  been  accounted  for  by  some  other  DDI  (either  from  the 
main/alternative  set  or  the  final  set),  and  5  represents  a  scaling  factor  used  to  additionally 
reduce  the  benefit  value  of  DDIs.  We  empirically  selected  5  =  0.01.  Out  of  all  evaluated 
DDIs,  PADDS  finds  a  DDI  with  the  highest  reassessed  benefit  and  adds  it  to  the  main  set.  In 
the  second  phase,  the  algorithm  iterates  between  the  alternative  set  and  the  main  set  until  all 
PPIs  that  are  explained  by  potential  DDIs  in  one  set  are  also  explained  by  potential  DDIs 


from  the  other  set.  In  phase  three,  PADDS  calculates  the  total  accumulative  benefit  5[j0t  for 
each  set  as: 

t=o°ri  ® 

m,n  m,n 


where  k  =  1  for  DDIs  that  were  added  into  a  set  based  on  their  original  benefit  value  Bmn,  and 
k  =  0  for  DDIs  that  were  added  into  a  set  based  on  their  reassessed  benefit  value  6„in  (Figure 
5  -  II,  Step:  5).  In  the  fourth  phase,  PADDS  compares  the  5[j0t  value  of  the  main  and 
alternative  sets.  If  the  main  set  has  the  greatest  Bj:°t  value,  PADDS  flags  the  DDI  of  interest 
and  assigns  it  this  value.  Then,  in  phase  five,  for  all  flagged  DDIs,  PADDS  finds  the  one  with 
the  highest  5[j0t  value  and  adds  this  DDI  to  the  final  set  of  DDIs  (Figure  5  -  II,  Step:  5).  In 
the  case  where  no  DDI  is  flagged,  i.e.,  all  alternative  sets  have  higher  values  than  their 
corresponding  main  set  counterparts,  PADDS  assigns  to  each  DDI  a  value  equal  to  the  ratio 
of  By jot  of  the  main  set  and  Bit°t  of  the  alternative  set.  Then,  PADDS  adds  DDI  with  the 
highest  ratio  value  to  th e,  final  set  (Figure  5  -  III).  This  evaluation  procedure  is  repeated  until 
all  given  PPIs  are  explained  by  DDIs  from  th e  final  set.  Extracted  sets  of  potential  DDIs  that 
are  common  to  all  values  of  a  are  denoted  as  the  core  set. 

Ties  between  two  DDIs  are  broken  in  the  following  order:  1)  minimum  Ny,  2)  maximum  Oy, 
3 )  maximum  number  of  times  the  interaction  between  domains  i  and  j  explains  a  single  PPI 
multiple  times  ( e.g .,  if  both  proteins  contain  domains  i  and  j,  then  that  PPI  can  be  explained 
by  two  i  -  j  domain  interactions),  4)  maximum  Cy,  5)  maximum  number  of  unique  PPIs  that  a 
DDI  explains,  and  6)  maximum  benefit.  In  cases  where  ties  are  not  broken  after  this 
procedure,  they  are  broken  randomly. 

Data  and  implementation  of  other  reconstitution  methods:  GPE  and  MSSC 

For  the  comparison  with  the  GPE  method  [21]  on  the  dataset  from  Riley  et  al.  [27],  we  used 
two  sets  of  published  results;  the  first  contained  the  top  1,399  high-confidence  DDIs  (denoted 
“ GPE-HC ”)  and  the  second  contained  7,554  DDIs  of  lower-confidence  (denoted  “GPE-LC') 
that  were  not  necessarily  included  in  the  first  set.  In  the  comparisons,  we  used  the  DDI  rank 
information  provided  with  the  published  data  [21]. 

For  the  yeast  high-confidence  dataset  comparison,  we  used  the  MSSC  program  available 
from  the  authors’  Web  site  to  extract  and  rank  DDIs  using  the  association  score  [28].  We 
implemented  the  GPE  algorithm  in  MATLAB  using  the  parameter  values  specified  by  the 
authors  and  ranked  DDIs  using  the  LP- score,  following  the  methodology  detailed  in  the 
original  manuscript  [21]. 

Endnotes 

aThis  set  of  proteins  contains  the  translations  of  all  systematically  named  ORFs,  except  ORFs 
designated  as  “dubious”  or  “pseudogenes.” 

bThe  remaining  domain  annotations  did  not  change,  had  minor  modifications  compared  to  the 
previous  version,  had  domains  of  unknown  function,  had  domains  assigned  to  previously 
unannotated  proteins,  or  had  domains  assigned  to  previously  unannotated  sequence  segments. 
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AdditionaI_fiIe_l  as  ZIP 

Additional  file  1  Merged  annotation.  This  file  contains  lists  of  new  protein-domain 
annotations  and  lists  that  map  new  domain  labels  onto  the  domain  labels  used  in  the  original 
annotation  databases. 

Additional_file_2  as  XLSX 

Additional  file  2  Merged  protein-domain  annotations  that  correspond  to  domain  annotations 
from  the  new  PFAM-A  release.  This  file  contains  a  list  of  202  merged  protein-domain 
annotations  corresponding  to  domain  annotations  from  the  current  PFAM-A  release  (release 
26.0)  that  did  not  exist  in  the  previous  PFAM-A  release  (release  25.0),  but  had  a  matching 
domain  annotation  in  our  merged  set.  Additionally,  this  file  contains  the  merged  protein- 
domain  annotations  corresponding  to  the  current  PFAM-A  domain  annotations  that  replaced 
the  annotations  from  the  previous  PFAM  release. 

Additional_file_3  as  XLSX 

Additional  file  3  Domain-domain  interactions  extracted  by  the  parameter-dependent  DDI 
selection  (PADDS)  method  -  multiple  organisms.  This  file  contains  lists  of  domain-domain 
interactions  extracted  by  PADDS  for  the  Riley  dataset  [27]  for  all  values  of  the  parameter  a 
analyzed  in  the  main  text. 

Additional_file_4  as  PDF 

Additional  file  4  Supplementary  material.  This  file  provides  the  supplemental  text, 
supplemental  Figures  SI  -  S4,  and  supplemental  Tables  SI  -  S4. 


A  InterPro  output: 


YNL271C  ♦ 

1 

PFAM-A  - 

Superfamily  - 


Drf  GBD  Drf  FH3 


ARM  repeat 


SMART 


1953 


FH2 


Fonniii  homology  2  domain  (FH2  domain) 


Fonnin  Homology 


B 

C 

D 


Possible  merged  annotations 


Naive 


D1D2  D3 


D5 


1 


Sequence 

based 

Sequence  & 
labels/names 


Figure  1  1)1,  D2,  D7  denote  domain  names 


Precision  Known  DDIs  retrieved  (%)  Known  DDIs  retrieved  (%) 


0.0  0.2  0.4  0.6  0.8  1.0 


a 


Figure  2 


DDI  set  size  (103) 


80 

60 

40 

20 

0 


DDIs  represented  by  a  single  value  of  a 
DDIs  represented  by  all  values  of  a 


Annotation  sources 


ITERATION 


INPUT 


Sequence  information 


Protein  P  (sequence  length  «) 


MMTA...SDLLTLPLLAQYSSSA...PQNVFY 


Protein-domain  annotation 

Protein  P,  databases:  DB1,  DB2,  DB3 


DB1 1  INI  II' 


wd40  I  abc-atpase  C  abc 


DB2C 


□  wd-40  ■  abc-smc5  ■  fmt 


DB3I 


I  kinase  □  smc 
d  formyltransferase 


Predefined  label  relationships 


fmt  =  formyltransferase 
kinase  ^  not-ldnase 


PROCESSING 


If 


Step  I:  domain  repeats  merging 


DB1  [ 
DB2  £ 
DB3 


Step  II:  pairs  of  databases  merging 
DB1-DB1  I  I  I  - 


DB1  - DB2  [ 
DB1  -DB3 
DB2  -  DB3 


Step  III:  final  annotation  set 
DB1-DB2-DB3I 


m 


Equality  between  domain  labels 


□ 

1 

□ 

□ 

1 

■ 

□ 

□ 

■ 

□ 

Y 

Y 

■ 

Y 

Y 

Y 

□ 

Y 

Y 

Y 

□ 

Y 

Y 

□ 

Y 

Y 

Y 

Y 

1 

Y 

E 

i 

Y 

□ 

Y 

Y 

□ 

E 

Y 

Y  —  equality  based  on  the  embedded 
label  comparison  procedure 
E  -  equality  based  on  predefined 
(input)  label  equivalence 


OUTPUT 


IE 


New-old  domain  label  map 

Dl:  wd40,  wd-40 
D2:  pkinase 

D3:  abc-atpase,  abc,  abc-smc5,  smc 
D4:  fmt,  formyltransferase 


Merged  domain  annotation 


P:  D1,D2,D3,  D4 


Merging  exceptions 

wd40  ^  pkinase 
wd-40  t  pkinase 


|  Overlapping,  but  not  merged  domains  I  I  Overlapping  and  merged  domains  ---  Optional  (exception  items) 


I 


Protein-protein 

interactions: 

Protein-domain 

annotations: 

P1-P2 

PI  :  Dl,  D2 

PI-P3 

P2  :  D3,  D4 

P1-P4 

P3  :  D4,  D5 

P2-P3 

P4  :  D4 

P4-P5 

P5  :  Dl,  D6 

P4-P6 

P6  :  D3,D7 

P4-P7 

P7  :  D8 

DDI 

Bjj  (for  a  =  0.0) 

D1-D3 

1  +  0.0*4  +  1  =  ,0° 

D1-D4 

4.00 

D1-D5 

1.00 

D2-D3 

1.00 

D2-D4 

3.00 

D2-D5 

1.00 

D3-D4 

3.00 

D3-D5 

1.00 

D4-D4 

1.00 

D4-D5 

2.00 

D4-D6 

1.00 

D4-D7 

1.00 

D4-D8 

1.00 

II 


Step:  0 


Evaluation  procedure  for  domain-domain  interacting  pair  D1-D3: 

v:. 


Step:  1 


Step:  2 


Step:  3 


Step:  4 


Step:  5 


MAIN 

VI 

PPIs:  ^ 
0 

1 

> 

ALTERNATIVE 

MAIN 

VJ 

PPIs:  ^ 

1 

► 

*-• 

I 

ALTERNATIVE 

Dl  D3  (4.00) 

PPIs: 

0 


V 


PPIs: 

0 


DDI 

B« 

D1-D4 

4.00 

D2-D4 

3.00 

D2-D3 

1.00 

MAIN 

\ 

D1-D3  (1.00) 

PPIs: 

1 

Is: 


MAIN 

< 

I 

ALTERNATIVE 

D1-D3  (1.00) 

D1-D4  (4.00) 

D2-D4  (1.99) 

PPIs:  ^ 

► 

3 

MAIN 

yi , 

ALTERNATIVE 

D1-D3  (1.00) 
D2-D4  (1.99) 
D4-D6  (1.00) 

. 

PPIs: 

4 

D1-D4  (4.00) 

VI. 


’Pis: 


D2-D3 

D2-D5 


MAIN 

•  •  f 

ALTERNATIVE 

m  t 

By 

B,' 

v  I  _ 

V  1  ^  D4-D6 

1.00 

D1-D3  (1.00) 

D1-D4  (4.00) 

•  T  •  D1-D5 

-0.01 

D2-D4  (1-99) 

X  D2-D3 

l.M 

-0.01 

PPIs:  w 

PPIS:  *  D2-D5 

1  <JU 

-0.01 

3 

4 

VI. 


PPIs: 


B^Vmain)  -  1  +  1.99  +  1  -  3.99 

Dt0t 

Bi(  (ALTERNATnL)  =4 


By  (MAES')  <  By  (ALTERNATIVE) 

Flag  =  0 

Score  =  ^*22.  =  0.99 
4 


1st  iteration: 

FINAL  SET  =  0 

Initial  step  of  the  evaluation  process: 
MAIN  =  0 
ALTERNATIVE  =  0 


ALTERNATIVE 

DDI 

»ij 

B«r 

i— m  D2-D4 

3-iX 

D1-D4  (4.00) 

D2-D3 

1.00 

3 

PPIs:  t 

Dl  D5 

1.00 

D2-D5 

1.00 

D4-D6 

1.00 

-D1-D3  is  evaluated  ^  MAIN  =  {D1-D3} 
-Alternatives  to  D1-D3: 

D1-D4  (4.00),  D2-D3  (1.00),  D2-D4  (3.00) 

-Best  alternative  =  D1-D4  (4.00) 

l 

ALTERNATIVE  =  (1)104; 

-Alternatives  to  D1-D4: 

D2-D3,  D2-D4,  D1-D5,  D2-D5,  D4-D6 
-D2-D4  shares  PPI  with  D1-D3  (P1-P2) 

-  benefit  of  D2-D4  reduced 

-Best  alternative  =  D2-D4  (1.99) 

MAIN  =  {D1-D3,  D2-D4} 


-Alternatives  to  D1-D3  and  D2-D4: 

D2-D3,  D2-D5 

-D2-D3  shares  PPI  with  D1-D4  (P1-P2) 
-D2-D5  shares  PPI  with  D1-D4  (P1-P3) 

-New  benefits  <  0  ^  Total  number  of  PPIs: 
3  <  4  ^  Proceed  to  the  next  step 

-Alternatives  to  D1-D4: 

D2-D3,  D1-D5,  D2-D5,  D4-D6 
-D1-D5,  D2-D3,  and  D2-D5  shares  PPI  with 
D2-D4  (P1-P3  and  P1-P2) 

-Best  alternative  =  D4-D6  (1.00) 

MAIN  =  {D1-D3,  D2-D4,  D4-D6} 


-DDIs  from  MAIN  account  for  the  same 
PPIs  as  DDIs  from  ALTERNATIVE 
-Calculate  total  benefits  for  MAIN  and 
ALTERNATIVE 

set  flag  to  0, 
set  score  to  0.99 


By  (MAIN)  <  Bjj  (ALT.) 
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